Data analyst vs Data scientist

    “A data scientist is someone who can predict the future based on past patterns whereas a data analyst is someone who merely curates meaningful insights from data.”
    “A data scientist is expected to generate their own questions while a data analyst finds answers to a given set of questions from data.” (https://www.projectpro.io/article/difference-between-data-analyst-and-data-scientist/332)                 

                                                  Įvadas
Darbui duomenys buvo paimti iš Kaggle puslapio:
    https://www.kaggle.com/andrewmvd/data-scientist-jobs
    https://www.kaggle.com/andrewmvd/data-analyst-jobs

Duomenų bazės atvaizduoja 2020-07 mėn. darbo skelbimus į duomenų mokslininko ir duomenų analitiko pozicijas iš JAV puslapio: 
    https://www.glassdoor.com/member/home/index.htm

Darbo tikslai:
    1. Palyginti specialistų atlyginimus.
    2. Kompiuterinių įgūdžių skirtumai/panašumai.
    3. Kokiose darbo sektoriuose reikalingi šie specialistai? Atlyginimai pagal sektorius. TOP10 įmonių ir TOP10 miestų.        

Duomenų paruošimas¶

In [259]:
# Importuojame reikalingus modulius
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Duomenų skaitymas ir greita apžvalga¶

In [2]:
analyst = pd.read_csv('C:\\...\\DATA_specialists\\DataAnalyst.csv')
analyst.head(3)
Out[2]:
Unnamed: 0 Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
0 0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice\n3.2 New York, NY New York, NY 201 to 500 employees 1961 Nonprofit Organization Social Assistance Non-Profit $100 to $500 million (USD) -1 TRUE
1 1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York\n3.8 New York, NY New York, NY 10000+ employees 1893 Nonprofit Organization Health Care Services & Hospitals Health Care $2 to $5 billion (USD) -1 -1
2 2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K (Glassdoor est.) We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace\n3.4 New York, NY New York, NY 1001 to 5000 employees 2003 Company - Private Internet Information Technology Unknown / Non-Applicable GoDaddy -1
In [3]:
analyst.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2253 entries, 0 to 2252
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         2253 non-null   int64  
 1   Job Title          2253 non-null   object 
 2   Salary Estimate    2253 non-null   object 
 3   Job Description    2253 non-null   object 
 4   Rating             2253 non-null   float64
 5   Company Name       2252 non-null   object 
 6   Location           2253 non-null   object 
 7   Headquarters       2253 non-null   object 
 8   Size               2253 non-null   object 
 9   Founded            2253 non-null   int64  
 10  Type of ownership  2253 non-null   object 
 11  Industry           2253 non-null   object 
 12  Sector             2253 non-null   object 
 13  Revenue            2253 non-null   object 
 14  Competitors        2253 non-null   object 
 15  Easy Apply         2253 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory usage: 281.8+ KB
In [4]:
scientist = pd.read_csv('C:\\...\\DATA_specialists\\DataScientist.csv')
scientist.head(3)
Out[4]:
Unnamed: 0 Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
0 0 Senior Data Scientist $111K-$181K (Glassdoor est.) ABOUT HOPPER\n\nAt Hopper, we’re on a mission ... 3.5 Hopper\n3.5 New York, NY Montreal, Canada 501 to 1000 employees 2007 Company - Private Travel Agencies Travel & Tourism Unknown / Non-Applicable -1 -1
1 1 Data Scientist, Product Analytics $111K-$181K (Glassdoor est.) At Noom, we use scientifically proven methods ... 4.5 Noom US\n4.5 New York, NY New York, NY 1001 to 5000 employees 2008 Company - Private Health, Beauty, & Fitness Consumer Services Unknown / Non-Applicable -1 -1
2 2 Data Science Manager $111K-$181K (Glassdoor est.) Decode_M\n\nhttps://www.decode-m.com/\n\nData ... -1.0 Decode_M New York, NY New York, NY 1 to 50 employees -1 Unknown -1 -1 Unknown / Non-Applicable -1 TRUE
In [5]:
scientist.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3909 entries, 0 to 3908
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         3909 non-null   int64  
 1   Job Title          3909 non-null   object 
 2   Salary Estimate    3909 non-null   object 
 3   Job Description    3909 non-null   object 
 4   Rating             3909 non-null   float64
 5   Company Name       3909 non-null   object 
 6   Location           3909 non-null   object 
 7   Headquarters       3909 non-null   object 
 8   Size               3909 non-null   object 
 9   Founded            3909 non-null   int64  
 10  Type of ownership  3909 non-null   object 
 11  Industry           3909 non-null   object 
 12  Sector             3909 non-null   object 
 13  Revenue            3909 non-null   object 
 14  Competitors        3909 non-null   object 
 15  Easy Apply         3909 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory usage: 488.8+ KB
Po greitosios duomenų apžvalgos, matome, kad 
  • Lentelėse yra keli stulpeliai su indeksais,
  • Atlyginimų stulpelis reikalauja valymo - pašalinti tekstines reikšmes, bei simbolius,
  • Įmonių pavadinimuose yra papildomi simboliai ir skaičiaiiš stulpelio 'Rating',
  • Vietoje NaN reikšmių yra -1.

Duomenų apdorojimas¶

Lentelių apjungimas¶

In [6]:
# Kadangi įkeliant duomenis lentelėse atsirado indeksų dublikatai, pašalinkime po vieną stulpelį kiekvienoje lentelėje
analyst.drop(analyst.columns[[0]], axis=1, inplace=True)
analyst.head(3)
Out[6]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice\n3.2 New York, NY New York, NY 201 to 500 employees 1961 Nonprofit Organization Social Assistance Non-Profit $100 to $500 million (USD) -1 TRUE
1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York\n3.8 New York, NY New York, NY 10000+ employees 1893 Nonprofit Organization Health Care Services & Hospitals Health Care $2 to $5 billion (USD) -1 -1
2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K (Glassdoor est.) We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace\n3.4 New York, NY New York, NY 1001 to 5000 employees 2003 Company - Private Internet Information Technology Unknown / Non-Applicable GoDaddy -1
In [7]:
scientist.drop(scientist.columns[[0]], axis=1, inplace=True)
scientist.head(3)
Out[7]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
0 Senior Data Scientist $111K-$181K (Glassdoor est.) ABOUT HOPPER\n\nAt Hopper, we’re on a mission ... 3.5 Hopper\n3.5 New York, NY Montreal, Canada 501 to 1000 employees 2007 Company - Private Travel Agencies Travel & Tourism Unknown / Non-Applicable -1 -1
1 Data Scientist, Product Analytics $111K-$181K (Glassdoor est.) At Noom, we use scientifically proven methods ... 4.5 Noom US\n4.5 New York, NY New York, NY 1001 to 5000 employees 2008 Company - Private Health, Beauty, & Fitness Consumer Services Unknown / Non-Applicable -1 -1
2 Data Science Manager $111K-$181K (Glassdoor est.) Decode_M\n\nhttps://www.decode-m.com/\n\nData ... -1.0 Decode_M New York, NY New York, NY 1 to 50 employees -1 Unknown -1 -1 Unknown / Non-Applicable -1 TRUE
In [8]:
# Apjunkime lenteles concat f-jos pagalba į bendrą lentelę 'specialists' ir sukūrkime naują stulpelį 'specialist', kuriame
# priskirkime duomenims iš lentelės 'analyst' reikšmes 'Analyst', o iš lentelės 'scientist' reikšmes 'Scientist'
analyst['specialist'] = 'Analyst'
scientist['specialist'] = 'Scientist'
specialists = pd.concat([analyst, scientist], ignore_index=True)
specialists
Out[8]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply specialist
0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice\n3.2 New York, NY New York, NY 201 to 500 employees 1961 Nonprofit Organization Social Assistance Non-Profit $100 to $500 million (USD) -1 TRUE Analyst
1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York\n3.8 New York, NY New York, NY 10000+ employees 1893 Nonprofit Organization Health Care Services & Hospitals Health Care $2 to $5 billion (USD) -1 -1 Analyst
2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K (Glassdoor est.) We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace\n3.4 New York, NY New York, NY 1001 to 5000 employees 2003 Company - Private Internet Information Technology Unknown / Non-Applicable GoDaddy -1 Analyst
3 Data Analyst $37K-$66K (Glassdoor est.) Requisition NumberRR-0001939\nRemote:Yes\nWe c... 4.1 Celerity\n4.1 New York, NY McLean, VA 201 to 500 employees 2002 Subsidiary or Business Segment IT Services Information Technology $50 to $100 million (USD) -1 -1 Analyst
4 Reporting Data Analyst $37K-$66K (Glassdoor est.) ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl... 3.9 FanDuel\n3.9 New York, NY New York, NY 501 to 1000 employees 2009 Company - Private Sports & Recreation Arts, Entertainment & Recreation $100 to $500 million (USD) DraftKings TRUE Analyst
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6157 AWS Data Engineer $55K-$112K (Glassdoor est.) About Us\n\nTachyon Technologies is a Digital ... 4.4 Tachyon Technologies\n4.4 Dublin, OH Irving, TX 201 to 500 employees 2011 Company - Private IT Services Information Technology $10 to $25 million (USD) -1 -1 Scientist
6158 Data Analyst â Junior $55K-$112K (Glassdoor est.) Job description\nInterpret data, analyze resul... 5.0 Staffigo Technical Services, LLC\n5.0 Columbus, OH Woodridge, IL 51 to 200 employees 2008 Company - Private IT Services Information Technology $50 to $100 million (USD) -1 -1 Scientist
6159 Security Analytics Data Engineer $55K-$112K (Glassdoor est.) Job DescriptionThe Security Analytics Data Eng... 3.8 PDS Tech, Inc.\n3.8 Dublin, OH Irving, TX 5001 to 10000 employees 1977 Company - Private Staffing & Outsourcing Business Services $100 to $500 million (USD) -1 -1 Scientist
6160 Security Analytics Data Engineer $55K-$112K (Glassdoor est.) The Security Analytics Data Engineer will inte... 4.0 Data Resource Technologies\n4.0 Dublin, OH Omaha, NE 1 to 50 employees -1 Company - Private Accounting Accounting & Legal Less than $1 million (USD) -1 -1 Scientist
6161 Patient Safety Physician or Safety Scientist -... $55K-$112K (Glassdoor est.) Help us transform patients' lives.\nAt UCB, we... 3.7 UCB\n3.7 Slough, OH Brussel, Belgium 5001 to 10000 employees -1 Company - Public Biotech & Pharmaceuticals Biotech & Pharmaceuticals $2 to $5 billion (USD) AbbVie -1 Scientist

6162 rows × 16 columns

Apjungus lenetels apdorokime duomenis sekančia tvarka:
  1. Vietoje žymėjimo tuščių/neužpildytų reikšmių -1 įrašykime NaN reikšmes tolimesnei duomenų analizei
  2. Stulpelyje 'Salary Estimate' atskirkime atlyginimų reikšmes 'min'/'max', bei išveskime vidurkį
  3. Išrinkime stulpelyje 'Job Description' su darbo pozicijomis susijusius įgūdžius sukurdami naujus stulpelius
  4. Stulpelyje 'Company Name' ištrinkime įmonių pavadinimų pabaigoje esančius simbolius (pvz.,'\n4.4')
  5. Atskirkime stulpelyje 'Location' miesto pavadinimą nuo valstijų/valstybių santrumpų
  6. Ištrinkime nereikalingus duomenų analizei stulpelius

NaN reikšmės¶

In [9]:
specialists = specialists.replace(['-1', '-1.0', -1], np.nan)
specialists.head()
Out[9]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply specialist
0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice\n3.2 New York, NY New York, NY 201 to 500 employees 1961.0 Nonprofit Organization Social Assistance Non-Profit $100 to $500 million (USD) NaN TRUE Analyst
1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York\n3.8 New York, NY New York, NY 10000+ employees 1893.0 Nonprofit Organization Health Care Services & Hospitals Health Care $2 to $5 billion (USD) NaN NaN Analyst
2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K (Glassdoor est.) We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace\n3.4 New York, NY New York, NY 1001 to 5000 employees 2003.0 Company - Private Internet Information Technology Unknown / Non-Applicable GoDaddy NaN Analyst
3 Data Analyst $37K-$66K (Glassdoor est.) Requisition NumberRR-0001939\nRemote:Yes\nWe c... 4.1 Celerity\n4.1 New York, NY McLean, VA 201 to 500 employees 2002.0 Subsidiary or Business Segment IT Services Information Technology $50 to $100 million (USD) NaN NaN Analyst
4 Reporting Data Analyst $37K-$66K (Glassdoor est.) ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl... 3.9 FanDuel\n3.9 New York, NY New York, NY 501 to 1000 employees 2009.0 Company - Private Sports & Recreation Arts, Entertainment & Recreation $100 to $500 million (USD) DraftKings TRUE Analyst

Stulpelio 'Salary Estimate' modifikacija¶

In [11]:
# Pažiūrėkime kokie yra skirtingi atlyginimo rėžiai ir jų kiekiai
specialists['Salary Estimate'].value_counts()
Out[11]:
$93K-$151K (Glassdoor est.)         62
$60K-$124K (Glassdoor est.)         62
$113K-$180K (Glassdoor est.)        61
$43K-$76K (Glassdoor est.)          61
$71K-$122K (Glassdoor est.)         60
                                    ..
$10-$26 Per Hour(Glassdoor est.)     3
$36K-$67K (Glassdoor est.)           3
$43K-$77K (Glassdoor est.)           3
$57K-$70K (Glassdoor est.)           2
$41K-$74K (Glassdoor est.)           1
Name: Salary Estimate, Length: 231, dtype: int64
In [12]:
# Patikrinkime ar yra NaN reikšmių šiame stulpelyje
specialists['Salary Estimate'].isna().sum()
Out[12]:
1
In [13]:
# Kadangi stulpelyje 'Salary Estimate' turime tik vieną NaN reikšmę, ištriname eilutę, turinčią šią reikšmę
specialists.dropna(subset=['Salary Estimate'], inplace=True)
In [14]:
# Patikrinkime ar ištrinėme
specialists['Salary Estimate'].isna().sum()
Out[14]:
0
In [15]:
# Sukūrkime naująjį kintamąjį iš 'Salary Estimate', kuriame ištriname skliaustelius, įrašus juose, raidę K ir simbolį '$'
salary = specialists['Salary Estimate'].apply(lambda x : x.split('(Glassdoor est.)')[0]).apply(lambda x : x.replace('K','').replace('$','').replace('Per Hour','').replace('(Employer est.)', ''))
salary
Out[15]:
0        37-66 
1        37-66 
2        37-66 
3        37-66 
4        37-66 
         ...   
6157    55-112 
6158    55-112 
6159    55-112 
6160    55-112 
6161    55-112 
Name: Salary Estimate, Length: 6161, dtype: object
In [16]:
# Sukūrkime stulpelius su minimalaus, maksimalaus ir vidutinio atlyginimo reikšmėmis
specialists['salary_min'] = salary.apply(lambda x : int(x.split('-')[0]))
specialists['salary_max'] = salary.apply(lambda x : int(x.split('-')[1]))
specialists['salary_avg'] = (specialists.salary_max + specialists.salary_min) / 2
#  Patikriname ar susikūrė nauji stulpeliai
specialists.head(3)
Out[16]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply specialist salary_min salary_max salary_avg
0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice\n3.2 New York, NY New York, NY 201 to 500 employees 1961.0 Nonprofit Organization Social Assistance Non-Profit $100 to $500 million (USD) NaN TRUE Analyst 37 66 51.5
1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York\n3.8 New York, NY New York, NY 10000+ employees 1893.0 Nonprofit Organization Health Care Services & Hospitals Health Care $2 to $5 billion (USD) NaN NaN Analyst 37 66 51.5
2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K (Glassdoor est.) We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace\n3.4 New York, NY New York, NY 1001 to 5000 employees 2003.0 Company - Private Internet Information Technology Unknown / Non-Applicable GoDaddy NaN Analyst 37 66 51.5

'Job Description' įgūdžiai¶

Iš interneto platybių Skills DA and DS https://bootcamp.cvn.columbia.edu/blog/data-analyst-skills/ Below, we've listed the top 11 technical and soft skills required to become a data analyst: Data Visualization Tableau Power BI Data Cleaning MATLAB R Python SQL and NoSQL Machine Learning Linear Algebra and Calculus Microsoft Excel Critical Thinking Communication

https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientist-article Statistical analysis and computing Machine Learning Deep Learning Processing large data sets Data Visualization Tableau Power BI Data Wrangling Mathematics R Python Statistics

In [18]:
# Sukūrkime naujus stulpelius su įgūdžiais, kur Įgūdžio stulpelio eilutėje bus įrašas 1 jeigu yra 'Job Description' stulpelyje, priešingu atveju
# bus įrašytas 0

specialists['Python'] = specialists['Job Description'].apply(lambda x : 1 if 'python' in x.lower() else 0)
specialists['SQL'] = specialists['Job Description'].apply(lambda x : 1 if 'SQL' in x.lower() or 'SQL' in x else 0)
specialists['R'] = specialists['Job Description'].apply(lambda x : 1 if 'R ' in x else 0)
specialists['SAS'] = specialists['Job Description'].apply(lambda x : 1 if 'SAS' in x else 0)
specialists['Java'] = specialists['Job Description'].apply(lambda x : 1 if 'java' in x.lower() else 0)
specialists['SPSS'] = specialists['Job Description'].apply(lambda x : 1 if 'spss' in x.lower() else 0)
specialists['Excel'] = specialists['Job Description'].apply(lambda x : 1 if 'excel' in x.lower() else 0)
specialists['Tableau'] = specialists['Job Description'].apply(lambda x : 1 if 'tableau' in x.lower() else 0)
specialists['PowerBI'] = specialists['Job Description'].apply(lambda x : 1 if 'powerbi' in x.lower() or 'power bi' in x.lower() else 0)
specialists['Machine Learning'] = specialists['Job Description'].apply(lambda x : 1 if 'machine learning' in x.lower() or 'ML' in x else 0)
specialists['Statistics'] = specialists['Job Description'].apply(lambda x : 1 if 'statistics' in x.lower() or 'stats' in x.lower() or 'statistical' in x.lower() else 0)

specialists[['Python','R','SAS','SQL','Java','SPSS','Excel','Tableau','PowerBI', 'Machine Learning', 'Statistics']].head(3)
Out[18]:
Python R SAS SQL Java SPSS Excel Tableau PowerBI Machine Learning Statistics
0 1 0 0 1 0 0 0 0 0 0 1
1 0 0 1 1 0 0 1 0 0 0 1
2 1 0 0 1 0 0 1 1 0 0 1
In [19]:
# Pažiūrėkime kiek radome skirtingų įgūdžių stulpelyje 'Job Description'
skills = {'Python' : specialists['Python'].sum(), 
          'SQL' : specialists['SQL'].sum(), 
          'R' : specialists['R'].sum(), 
          'SAS' : specialists['SAS'].sum(),
          'Java' : specialists['Java'].sum(), 
          'Excel' : specialists['Excel'].sum(), 
          'SPSS' : specialists['SPSS'].sum(), 
          'Tableau' : specialists['Tableau'].sum(), 
          'PowerBI' : specialists['PowerBI'].sum(),
          'Machine Learning' : specialists['Machine Learning'].sum(),
          'Statistics' : specialists['Statistics'].sum()}
skills
Out[19]:
{'Python': 2693,
 'SQL': 3392,
 'R': 1251,
 'SAS': 803,
 'Java': 996,
 'Excel': 3207,
 'SPSS': 294,
 'Tableau': 1292,
 'PowerBI': 534,
 'Machine Learning': 1880,
 'Statistics': 2638}
In [131]:
specialists.groupby('specialist')['Python'].value_counts()
Out[131]:
specialist  Python
Analyst     0         1616
            1          636
Scientist   1         2057
            0         1852
Name: Python, dtype: int64
In [132]:
specialists.groupby('specialist')['SQL'].value_counts()
Out[132]:
specialist  SQL
Analyst     1      1378
            0       874
Scientist   1      2014
            0      1895
Name: SQL, dtype: int64

Stulpelio 'Company Name' modifikacija¶

In [20]:
# Ištrinkime simbolius stulpelyje 'Company Name' nuo '\n' 
specialists['Company Name'] = specialists['Company Name'].apply(lambda x: str(x)[:-4])
specialists.head(3)
Out[20]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership ... SQL R SAS Java SPSS Excel Tableau PowerBI Machine Learning Statistics
0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice New York, NY New York, NY 201 to 500 employees 1961.0 Nonprofit Organization ... 1 0 0 0 0 0 0 0 0 1
1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York New York, NY New York, NY 10000+ employees 1893.0 Nonprofit Organization ... 1 0 1 0 0 1 0 0 0 1
2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K (Glassdoor est.) We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace New York, NY New York, NY 1001 to 5000 employees 2003.0 Company - Private ... 1 0 0 0 0 1 1 0 0 1

3 rows × 30 columns

Stulpelio 'Location' išskaidymas į miesto pavadinimą ir valstijų/valstybių santrumpą¶

In [23]:
# Ištraukime valstijų/valstybių santrumpą iš stulpelio 'Location', įrašydami jas į naują stulpelį
# Taip pat pažiūrėkime visas unikalias reikšmes
specialists['Location_state'] = specialists['Location'].apply(lambda x : x.split(',')[1])
specialists['Location_state'].unique()
Out[23]:
array([' NY', ' NJ', ' UT', ' CA', ' VA', ' FL', ' IL', ' TX', ' AZ',
       ' PA', ' DE', ' OH', ' NC', ' SC', ' IN', ' WA', ' GA', ' KS',
       ' CO', ' Arapahoe', ' United Kingdom'], dtype=object)
In [24]:
# Pakeiskime 'Arapahoe' iš 'Location_state' į 'CO', 'United Kingdom' į 'UK'
specialists['Location_state'] = specialists['Location_state'].replace('Arapahoe', 'CO', regex=True)
specialists['Location_state'] = specialists['Location_state'].replace('United Kingdom', 'UK', regex=True)
specialists['Location_state'].unique()
Out[24]:
array([' NY', ' NJ', ' UT', ' CA', ' VA', ' FL', ' IL', ' TX', ' AZ',
       ' PA', ' DE', ' OH', ' NC', ' SC', ' IN', ' WA', ' GA', ' KS',
       ' CO', ' UK'], dtype=object)
In [25]:
specialists.head(3)
Out[25]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership ... R SAS Java SPSS Excel Tableau PowerBI Machine Learning Statistics Location_state
0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice New York, NY New York, NY 201 to 500 employees 1961.0 Nonprofit Organization ... 0 0 0 0 0 0 0 0 1 NY
1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York New York, NY New York, NY 10000+ employees 1893.0 Nonprofit Organization ... 0 1 0 0 1 0 0 0 1 NY
2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K (Glassdoor est.) We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace New York, NY New York, NY 1001 to 5000 employees 2003.0 Company - Private ... 0 0 0 0 1 1 0 0 1 NY

3 rows × 31 columns

In [26]:
# Kadangi jau turime valstijų/valstybių santrumpų atsikrą stulpelį, tai ištrinkime santrumpas stulpelyje 'Location'
specialists['Location'] = specialists['Location'].apply(lambda x: str(x)[:-4])
In [27]:
specialists.head(2)
Out[27]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership ... R SAS Java SPSS Excel Tableau PowerBI Machine Learning Statistics Location_state
0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice New York New York, NY 201 to 500 employees 1961.0 Nonprofit Organization ... 0 0 0 0 0 0 0 0 1 NY
1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York New York New York, NY 10000+ employees 1893.0 Nonprofit Organization ... 0 1 0 0 1 0 0 0 1 NY

2 rows × 31 columns

Stulpelių šalinimas¶

In [28]:
# Patikrinkime kiek kiekviename lentelės stulpelyje yra NaN reikšmių
specialists.isna().sum()
Out[28]:
Job Title               0
Salary Estimate         0
Job Description         0
Rating                681
Company Name            0
Location                0
Headquarters          412
Size                  392
Founded              1637
Type of ownership     392
Industry              899
Sector                899
Revenue               392
Competitors          4491
Easy Apply           5917
specialist              0
salary_min              0
salary_max              0
salary_avg              0
Python                  0
SQL                     0
R                       0
SAS                     0
Java                    0
SPSS                    0
Excel                   0
Tableau                 0
PowerBI                 0
Machine Learning        0
Statistics              0
Location_state          0
dtype: int64
In [30]:
# Ištriname stulpelius, kuriuose trūksta daug reikšmių, bei kurių nenaudosime analizėje
specialists = specialists.drop(['Salary Estimate', 'Founded', 'Competitors', 'Easy Apply'], axis = 1)
specialists.head(3)
Out[30]:
Job Title Job Description Rating Company Name Location Headquarters Size Type of ownership Industry Sector ... R SAS Java SPSS Excel Tableau PowerBI Machine Learning Statistics Location_state
0 Data Analyst, Center on Immigration and Justic... Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice New York New York, NY 201 to 500 employees Nonprofit Organization Social Assistance Non-Profit ... 0 0 0 0 0 0 0 0 1 NY
1 Quality Data Analyst Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York New York New York, NY 10000+ employees Nonprofit Organization Health Care Services & Hospitals Health Care ... 0 1 0 0 1 0 0 0 1 NY
2 Senior Data Analyst, Insights & Analytics Team... We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace New York New York, NY 1001 to 5000 employees Company - Private Internet Information Technology ... 0 0 0 0 1 1 0 0 1 NY

3 rows × 27 columns

In [31]:
specialists.shape
Out[31]:
(6161, 27)

Duomenų analizė¶

Atlyginimų skirtumai tarp specialistų¶

In [32]:
specialists[['salary_avg']].describe()
Out[32]:
salary_avg
count 6161.000000
mean 94.802143
std 38.040594
min 18.000000
25% 64.000000
50% 85.500000
75% 122.000000
max 225.000000
In [33]:
# Atlyginimų pasiskirstymas (Histograma) imant abį specialybes

plt.figure(figsize=(4,2))
axes = sns.histplot(x='salary_avg', data = specialists, bins = 20, kde=True)
axes.axvline(specialists['salary_avg'].mean(), color='r', label = 'mean')
axes.axvline(specialists['salary_avg'].median(), color = 'g', label = 'median')
plt.xlabel('Average Salary')
axes.legend()

plt.show()
In [34]:
# Dabar pažiūrėkime atskirai pagal specialybes atlyginimų pasiskirstymus
specialists.groupby('specialist')[['salary_avg']].describe().transpose()
Out[34]:
specialist Analyst Scientist
salary_avg count 2252.000000 3909.000000
mean 72.123002 107.867741
std 23.600734 38.639390
min 33.500000 18.000000
25% 58.000000 73.000000
50% 69.000000 104.500000
75% 80.500000 133.000000
max 150.000000 225.000000
In [42]:
specialists['specialist'].value_counts()
Out[42]:
Scientist    3909
Analyst      2252
Name: specialist, dtype: int64
In [50]:
# Data analyst
plt.figure(figsize=(4,2))
axes_a = sns.histplot(x='salary_avg', data = specialists.loc[:2251], bins = 20, kde=True)
axes_a.axvline(specialists['salary_avg'].mean(), color='r', label = 'mean')
axes_a.axvline(specialists['salary_avg'].median(), color = 'g', label = 'median')
plt.xlabel('Analyst_Average Salary')
axes_a.legend()

# Data scientist
plt.figure(figsize=(4,2))
axes_s = sns.histplot(x='salary_avg', data = specialists.loc[2252:], bins = 20, kde=True)
axes_s.axvline(specialists['salary_avg'].mean(), color='r', label = 'mean')
axes_s.axvline(specialists['salary_avg'].median(), color = 'g', label = 'median')
plt.xlabel('Scientist_Average Salary')
axes_s.legend()

plt.show()

Iš gautų grafikų matome, reikšmių pasiskirstymai išsikreipę į kairę pusę, t.y. dominuoja mažesni už vidurkį atlyginimai labiau analitiko pozicijoje, negu mokslininko

Įgūdžių skirtumai/panašumai tarp specialistų¶

In [62]:
# Pažiūrėkime kokius įgūžius ir jų kiekius turime pagal specialistus atskirai
specialists.groupby('specialist')[['Python','R','SAS','SQL','Java','SPSS','Excel','Tableau','PowerBI', 'Machine Learning', 'Statistics']].sum()
Out[62]:
Python R SAS SQL Java SPSS Excel Tableau PowerBI Machine Learning Statistics
specialist
Analyst 636 400 359 1378 204 154 1353 619 248 328 844
Scientist 2057 851 444 2014 792 140 1854 673 286 1552 1794

Iš gautos lentelės skaičių daryti igūdžių palyginimą tarp specialistų būtų netikslinga, kadangi darbo skelbimų duomenų mokslininko pozicijai yra daugiau, negu duomenų analitiko. Taigi išrikiuokime mažėjimo tvarka turimus įgūdžius ir pažiūrėkime TOP5 kiekvienoje specialybėje atskirai ir palyginkime

In [64]:
# Analyst
skills_a=pd.DataFrame(specialists[specialists['specialist']=='Analyst'][['Python','R','SAS','SQL','Java','SPSS','Excel','Tableau','PowerBI', 'Machine Learning', 'Statistics']].sum().sort_values(ascending=False))
skills_a.rename(columns={0:'count_a'}, inplace=True)
skills_a['count_a_proc'] = skills_a['count_a']/skills_a['count_a'].sum()*100
skills_a=skills_a.reset_index()
skills_a.rename(columns={'index':'skill'}, inplace=True)
skills_a
Out[64]:
skill count_a count_a_proc
0 SQL 1378 21.125249
1 Excel 1353 20.741990
2 Statistics 844 12.938832
3 Python 636 9.750115
4 Tableau 619 9.489499
5 R 400 6.132148
6 SAS 359 5.503603
7 Machine Learning 328 5.028361
8 PowerBI 248 3.801932
9 Java 204 3.127395
10 SPSS 154 2.360877
In [66]:
# Scientist
skills_s=pd.DataFrame(specialists[specialists['specialist']=='Scientist'][['Python','R','SAS','SQL','Java','SPSS','Excel','Tableau','PowerBI', 'Machine Learning', 'Statistics']].sum().sort_values(ascending = False))
skills_s.rename(columns={0:'count_s'}, inplace=True)
skills_s['count_s_proc'] = skills_s['count_s']/skills_s['count_s'].sum()*100
skills_s=skills_s.reset_index()
skills_s.rename(columns={'index':'skill'}, inplace=True)
skills_s
Out[66]:
skill count_s count_s_proc
0 Python 2057 16.512804
1 SQL 2014 16.167617
2 Excel 1854 14.883198
3 Statistics 1794 14.401541
4 Machine Learning 1552 12.458858
5 R 851 6.831500
6 Java 792 6.357871
7 Tableau 673 5.402585
8 SAS 444 3.564261
9 PowerBI 286 2.295898
10 SPSS 140 1.123866
In [163]:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 4))
skills_a.plot.bar(x='skill', y='count_a', title='Data analyst skills', color='orange',ax=axs[0]);
skills_s.plot.bar(x='skill', y='count_s', title='Data scientist skills',ax=axs[1]);

Darbo sektoriai, atlyginimai. TOP Įmonės, miestai¶

In [147]:
# Pažiūrėkime kokiuose darbo sektoriuose reikalingi specialistai
specialists_sectors = specialists['Sector'].value_counts().dropna().sort_values(ascending=False).head(n=10)
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
specialists_sectors.plot.pie(autopct='%1.1f%%', ylabel="");
In [153]:
# Pažiūrėkime kokiuose darbo sektoriuose reikalingi specialistai atskirai
specialists_sectors_s = specialists[specialists['specialist'] == 'Scientist']['Sector'].value_counts().dropna().sort_values(ascending=False).head(n=10)
specialists_sectors_a = specialists[specialists['specialist'] == 'Analyst']['Sector'].value_counts().dropna().sort_values(ascending=False).head(n=10)

fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
specialists_sectors_a.plot.pie(autopct='%1.1f%%', ylabel="", title='Data analyst/Sectors');
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(5, 5))
specialists_sectors_s.plot.pie(autopct='%1.1f%%', ylabel="", title='Data scientist/Sectors');
In [260]:
# Pažiūrėkime kokie atlyginimai svyruoja daugiausią darbo pasiūlymų turinčiuose sektoriuose
analyst = specialists[specialists['specialist'] == 'Analyst']
scientist = specialists[specialists['specialist'] == 'Scientist']

plt.figure(figsize = (8,4))
sns.boxplot(x = 'salary_avg', y = 'Sector', data = analyst, order = ['Information Technology', 'Business Services', 'Biotech & Pharmaceuticals','Finance', 'Health Care'], palette = 'GnBu_d')
plt.xlabel('Average Salary')
plt.ylabel('Sector')
plt.title('Data Analyst/Average Salary by Sector', fontsize = 12)

plt.figure(figsize = (8,4))
sns.boxplot(x = 'salary_avg', y = 'Sector', data = scientist, order = ['Information Technology', 'Business Services', 'Biotech & Pharmaceuticals','Finance', 'Health Care'], palette = 'GnBu_d')
plt.xlabel('Average Salary')
plt.ylabel('Sector')
plt.title('Data Scientist/Average Salary by Sector', fontsize = 12)

plt.show()
In [180]:
# Išrinkime TOP10 miestų ir TOP10 įmonių kuriuose yra siūlomos abi darbo pozicijos
specialists_locations = specialists['Location'].value_counts().sort_values(ascending=False).head(n=10)
specialists_companies = specialists['Company Name'].value_counts().sort_values(ascending=False).head(n=10)

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 4))
specialists_locations.plot.bar(ylabel="No. of jobs", xlabel="", color='orange', ax=axs[0], title='TOP10 Locations');
specialists_companies.plot.bar(ylabel="No. of jobs", xlabel="", ax=axs[1], title='TOP10 Companies');
In [177]:
# Išrinkime TOP10 miestų ir TOP10 įmonių / Data analyst
specialists_locations_a = specialists[specialists['specialist'] == 'Analyst']['Location'].value_counts().sort_values(ascending=False).head(n=10)
specialists_companies_a = specialists[specialists['specialist'] == 'Analyst']['Company Name'].value_counts().sort_values(ascending=False).head(n=10)

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 4))
specialists_locations_a.plot.bar(ylabel="No. of jobs", xlabel="", color='orange', ax=axs[0], title='TOP10 Data analyst/Locations');
specialists_companies_a.plot.bar(ylabel="No. of jobs", xlabel="", ax=axs[1], title='TOP10 Data analyst/Companies');

# Išrinkime TOP10 miestų ir TOP10 įmonių / Data scientist
specialists_locations_s = specialists[specialists['specialist'] == 'Scientist']['Location'].value_counts().sort_values(ascending=False).head(n=10)
specialists_companies_s = specialists[specialists['specialist'] == 'Scientist']['Company Name'].value_counts().sort_values(ascending=False).head(n=10)

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 4))
specialists_locations_s.plot.bar(ylabel="No. of jobs", xlabel="", color='orange', ax=axs[0], title='TOP10 Data scientist/Locations');
specialists_companies_s.plot.bar(ylabel="No. of jobs", xlabel="", ax=axs[1], title='TOP10 Data scientist/Companies');
In [193]:
# Kadangi 'Job Description' stulpelio eilutėse yra labai daug teksto, mes jį pašalinsime, prieš įkeliant į Tableau.
final=specialists.drop(columns=['Job Description'])
# final
In [194]:
import os
# os.getcwd()
os.makedirs('baigiamasis', exist_ok=True)  
final.to_csv('baigiamasis/final_work.csv')

Išvados¶

Pagal iškeltus darbe tikslus, galime padaryti sekančias išvadas:
    1. Pagal Glassdoor darbo skelbimus duomenų mokslininkų vidutinis atlyginimas (107K) yra gerokai didesnis už duomenų analitiko (72K).
    2. Imant TOP5 reikalaujamų kompiuterinių įgūdžių specialistams, pamatėme, kad abiems dominuoja praktiškai vienodi.
Duomenų mokslininkui - 'Python','SQL','Excel', 'Statistics', 'Machine Learning', 
o duomenų analitikui - 'SQL', 'Excel', 'Statistics', 'Python','Tableau'.
    3. Populiariausi darbo sektoriai Glassdoor puslapyje abiems specialistams yra 'Information Technology', 'Business services'ir 'Finance', o sektoriuje 'Biotech & Pharmaceuticals' jau yra reikalingi labiau duomenų mokslininkai.
     Kas liečia atlyginimus pagal darbo sektorius, tai duomenų analitiko pozicijai geriasią atlygį siūlo 'Biotech & Pharmaceuticals' sektorius, o duomenų mokslininkui 'Information Technology'.
     Didžiausia paklausa duomenų mokslininkų yra 'Apple', 'IBM' ir 'Amazon' įmonėse, tuo tarpu duomenų analitikų - 'Staffigo Technical Services' LLC, 'Diverse Lynx', Lorven Technologies Inc.

In [ ]: